If you have worked on this assignment in groups, then only a single submission is required from the group, but you should make a note of collaborators at the start of the document
Finding the most in-demand jobs around the world right now to see which countries/continents lack in which set of skill and also use this as a general guideline template for future students to look through in order to make relevant and practical study choices. Other question that we can ask would be is there a vast difference between countries and if there is a bias between the choices made by the population in these countries?
To answer this question, we could follow this manual process:
On the “The Most In-Demand Careers Around the World” article, the page contains a list of jobs:
Identify the job titles based on the headings
We know that the required URL to get the components from is https://www.westernunion.com/blog/jobs-in-demand/, and as we scroll down, we are able to identify the job titles headings, like below:
Choose the headings that we are wanting to get information / scrape)
In the next section, we are going to scrape the job titles for the Western Union blog article. To do so, we must know how the page structures its documents and what hints can we extract from it to make it easier to identify the pages that we need. Also, we want to extract the ‘Countries in Demand’ section, so we also have to find the correct CSS selectors.
As we try to find the CSS selector for the ‘Countries in Demand’ section, we have to figure out how to exclude the paragraphs in CSS sections so that we scrape for Countries in Demand element, it would not get all of the irrelevant paragraphs (p) selectors.
The parts of the countries in demand that we need to scrape (compact interface shown)
This is how we get data using the Chrome Developer Tools
This is how we get data using the Chrome Developer Tools
We start with defining a few key components (the required libraries, starting url, and key XPath/CSS selectors):
library(rvest)
## Loading required package: xml2
blog.url <- "https://www.westernunion.com/blog/jobs-in-demand/"
doc.url <- read_html(blog.url)
list.jobtitles <- html_nodes(doc.url, "#main > div.row.row-flex.post-single > div > div > div.row.post-data > div > div.post-content > h2")
raw.data <- html_text(list.jobtitles)
clean.jobtitles.data <- gsub("[\n]", "", raw.data)
clean.jobtitles.data
## [1] "Software Engineers and Developers" "Mechanical & Civil Engineers"
## [3] "IT Analysts" "Accountants"
## [5] "Surveyors & Skilled Construction " "Teachers "
## [7] "Chefs " "Pharmacists "
## [9] "Psychologists"
However for the second part, we have to exclude other paragraphs.
countries.selector <- "//*[@id='main']/div[2]/div/div/div[2]/div/div[3]/p[5]"
list.paragraphs <- html_nodes(doc.url, xpath=countries.selector)
raw.paragraphs.data <- html_text(list.paragraphs)
raw.paragraphs.data
## [1] "Countries in Demand: Belgium, Czech Republic, Denmark, Estonia, France, Germany, Greece, Ireland, Israel, Luxembourg, Netherlands, Russia, Slovak Republic, Sweden, United Kingdom, Canada, Mexico, United States, Brazil, Chile, Japan, Korea, Australia, New Zealand"
However for the second part, we have to exclude other paragraphs. We have to perform the iteration over the href attributes that we discovered, extract the relevant page elements, and add them into a results list (steps 2 a, b, and c in our previously defined workflow). :
countries.selector <- "//*[@id='main']/div[2]/div/div/div[2]/div/div[3]/p[9]"
list.paragraphs <- html_nodes(doc.url, xpath=countries.selector)
raw.paragraphs.data <- html_text(list.paragraphs)
raw.paragraphs.data
## [1] "Countries in Demand: Denmark, Belgium, Estonia, Norway, Switzerland, Turkey, Finland, Greece, Iceland, Netherlands, Russia, Slovak Republic, Spain, United Kingdom, Hungary, Luxembourg, Sweden, Germany, Ireland, Canada, Mexico, United States, Brazil, Chile, Australia, New Zealand "
We start with defining a few key components (the required libraries, starting url, and key XPath/CSS selectors):
library(rvest)
search.url <- "https://www.seek.co.nz/jobs?keywords=%22data+science%22"
job.selector <- "//article[@data-automation='normalJob']" ## note: XPath selector!
title.selector <- "h1 a"
As, we’re going to be scraping multiple pages, and the source links will be scraped from our start document, we will need to use a session to keep track of the relevant details:
doc <- html_session(search.url)
And now, we should retrieve our required list of advertised jobs:
jobs <- html_nodes(doc, xpath=job.selector)
cat("Fetched", length(jobs), "results\n")
## Fetched 20 results
note that we used the xpath parameter of the html_nodes() function instead of using a normal CSS selector. This was due to the selection of by attribute status, which cannot be easily done using a CSS selector.
Having found the 20 jobs, we need to extract the required hyperlinks that lead us to the job description pages:
job.links <- html_nodes(jobs, title.selector)
job.href <- html_attr(job.links, "href")
Now, we perform the iteration over the href attributes that we discovered, extract the relevant page elements, and add them into a results list (steps 2 a, b, and c in our previously defined workflow).
location.selector <- "dl > dd:nth-child(4) > span > span > strong"
job.locations <- NULL ## a container for our results, starts off empty
for (job in job.href) {
job.loc <- tryCatch({
job.doc <- jump_to(doc, job)
job.loc <- html_node(job.doc, location.selector)
html_text(job.loc)
}, error=function(e) NULL)
## add the next location to our results vector
job.locations <- c(job.locations, job.loc)
}
job.locations
## [1] "Wellington" "Wellington" "Wellington" "Wellington" "Canterbury"
## [6] "Wellington" "Auckland" "Wellington" "Auckland" "Auckland"
## [11] "Canterbury" "Canterbury" "Canterbury" "Manawatu" "Auckland"
## [16] "Taranaki" "Auckland" "Auckland" "Auckland" "Auckland"
Note the use of the tryCatch() function here - this is to handle any cases where following a link may produce an error (e.g., a 404 error for a broken link referring to a missing page).
Our job.locations vector is now complete, and should contain an entry for every job advertised. Now, we can create a tally of these locations and plot them using a suitable method (e.g., a bar plot):
tally <- table(job.locations)
barplot(tally, main="Location of Data Science Jobs on Seek", ylab="# Jobs Found", col="#00508f")
The analysis done here suggests that the majority of data science jobs are in the North Island and centred around either Auckland or Wellington.
The analysis here is rather simple - we have only really considered a single search for jobs on a single web site. To be more rigorous, we should examine multiple web sites, and maybe attempt to consider other terms often confused with or related to “data science” (e.g., “predictive analytics”, “data mining”, “machine learning”).